# Vision-Language-Action Model
Nora Long
A vision-language-action model trained on the Open X Embodied Dataset, generating robot motions from language instructions and camera images
Multimodal Fusion
Transformers

N
declare-lab
673
5
Pi0fast Base
Apache-2.0
π0+FAST is an efficient action tokenization scheme designed for robotics by Physical Intelligence, suitable for vision-language-action tasks.
Multimodal Fusion
P
lerobot
1,372
12
Jarvisvla Qwen2 VL 7B
MIT
A vision-language-action model specifically designed for Minecraft, capable of executing thousands of in-game skills based on human language commands
Image-to-Text
Transformers English

J
CraftJarvis
163
8
Spatialvla 4b 224 Sft Fractal
MIT
SpatialVLA is a vision-language-action model fine-tuned on the fractal dataset, primarily used for robot control tasks.
Text-to-Image
Transformers English

S
IPEC-COMMUNITY
375
0
Spatialvla 4b 224 Sft Bridge
MIT
This model is a vision-language-action model fine-tuned on the bridge dataset based on the SpatialVLA model, specifically designed for the Simpler-env benchmark.
Text-to-Image
Transformers English

S
IPEC-COMMUNITY
1,066
0
Cogact Small
MIT
CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.
Multimodal Fusion
Transformers English

C
CogACT
405
4
Cogact Large
MIT
CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.
Multimodal Fusion
Transformers English

C
CogACT
122
3
Cogact Base
MIT
CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.
Multimodal Fusion
Transformers English

C
CogACT
6,589
12
Featured Recommended AI Models